Goto

Collaborating Authors

 proposal candidate






Patherea: Cell Detection and Classification for the 2020s

Štepec, Dejan, Jerše, Maja, Đokić, Snežana, Jeruc, Jera, Zidar, Nina, Skočaj, Danijel

arXiv.org Artificial Intelligence

This paper presents a Patherea, a framework for point-based cell detection and classification that provides a complete solution for developing and evaluating state-of-the-art approaches. We introduce a large-scale dataset collected to directly replicate a clinical workflow for Ki-67 proliferation index estimation and use it to develop an efficient point-based approach that directly predicts point-based predictions, without the need for intermediate representations. The proposed approach effectively utilizes point proposal candidates with the hybrid Hungarian matching strategy and a flexible architecture that enables the usage of various backbones and (pre)training strategies. We report state-of-the-art results on existing public datasets - Lizard, BRCA-M2C, BCData, and the newly proposed Patherea dataset. We show that the performance on existing public datasets is saturated and that the newly proposed Patherea dataset represents a significantly harder challenge for the recently proposed approaches. We also demonstrate the effectiveness of recently proposed pathology foundational models that our proposed approach can natively utilize and benefit from. We also revisit the evaluation protocol that is used in the broader field of cell detection and classification and identify the erroneous calculation of performance metrics. Patherea provides a benchmarking utility that addresses the identified issues and enables a fair comparison of different approaches. The dataset and the code will be publicly released upon acceptance.


Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

Xu, Xiaoxu, Yuan, Yitian, Zhang, Qiudan, Wu, Wenhui, Jie, Zequn, Ma, Lin, Wang, Xu

arXiv.org Artificial Intelligence

Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose \textbf{3D-VLA}, a weakly supervised approach for \textbf{3D} visual grounding based on \textbf{V}isual \textbf{L}inguistic \textbf{A}lignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.


The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang, Hao, Sun, Aixin, Jing, Wei, Zhou, Joey Tianyi

arXiv.org Artificial Intelligence

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.


Sequential Context Encoding for Duplicate Removal

Qi, Lu, Liu, Shu, Shi, Jianping, Jia, Jiaya

Neural Information Processing Systems

Duplicate removal is a critical step to accomplish a reasonable amount of predictions in prevalent proposal-based object detection frameworks. In this work, we design a new two-stage framework to effectively select the appropriate proposal candidate for each object. The first stage suppresses most of easy negative object proposals, while the second stage selects true positives in the reduced proposal set. These two stages share the same network structure, an encoder and a decoder formed as recurrent neural networks (RNN) with global attention and context gate. The encoder scans proposal candidates in a sequential manner to capture the global context information, which is then fed to the decoder to extract optimal proposals.


Sequential Context Encoding for Duplicate Removal

Qi, Lu, Liu, Shu, Shi, Jianping, Jia, Jiaya

Neural Information Processing Systems

Duplicate removal is a critical step to accomplish a reasonable amount of predictions in prevalent proposal-based object detection frameworks. Albeit simple and effective, most previous algorithms utilize a greedy process without making sufficient use of properties of input data. In this work, we design a new two-stage framework to effectively select the appropriate proposal candidate for each object. The first stage suppresses most of easy negative object proposals, while the second stage selects true positives in the reduced proposal set. These two stages share the same network structure, i.e., an encoder and a decoder formed as recurrent neural networks (RNN) with global attention and context gate. The encoder scans proposal candidates in a sequential manner to capture the global context information, which is then fed to the decoder to extract optimal proposals. In our extensive experiments, the proposed method outperforms other alternatives by a large margin.


Sequential Context Encoding for Duplicate Removal

Qi, Lu, Liu, Shu, Shi, Jianping, Jia, Jiaya

Neural Information Processing Systems

Duplicate removal is a critical step to accomplish a reasonable amount of predictions inprevalent proposal-based object detection frameworks. Albeit simple and effective, most previous algorithms utilize a greedy process without making sufficient useof properties of input data. In this work, we design a new two-stage framework to effectively select the appropriate proposal candidate for each object. Thefirst stage suppresses most of easy negative object proposals, while the second stage selects true positives in the reduced proposal set. These two stages share the same network structure, i.e., an encoder and a decoder formed as recurrent neuralnetworks (RNN) with global attention and context gate. The encoder scans proposal candidates in a sequential manner to capture the global context information, whichis then fed to the decoder to extract optimal proposals. In our extensive experiments, the proposed method outperforms other alternatives by a large margin.